Document Similarity Misjudgment by LSA: Misses vs. False Positives

نویسندگان

Kyung Hun Jung

Eric Ruthruff

Timothy Goldsmith

چکیده

Modeling text document similarity is an important yet challenging task. Even the most advanced computational linguistic models often misjudge document similarity relative to humans. Regarding the pattern of misjudgment between models and humans, Lee and colleagues (2005) suggested that the models’ primary failure is occasional underestimation of strong similarity between documents. According to this suggestion, there should be more extreme misses (i.e., models failing to pick up on strong document similarity) than extreme false positives (i.e., models falsely detecting document similarity that does not exist). We tested this claim by comparing document similarity ratings generated by humans and latent semantic analysis (LSA). Notably, we implemented LSA with 441 unique parameter settings, determined optimal parameters that yielded high correlations with human ratings, and finally identified misses and false positives under the optimal parameter settings. The results showed that, as Lee et al. predicted, large errors were predominantly misses rather than false positives. Potential causes of the misses and false positives are discussed.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Information Retrieval Perspective to Interactive Data Visualization

Dimensionality reduction for data visualization has recently been formulated as an information retrieval task with a well-defined objective function. The formulation was based on preserving similarity relationships defined by a metric in the input space, and explicitly revealed the need for a tradeoff between avoiding false neighbors and missing neighbors on the low-dimensional display. In the ...

متن کامل

The Neural Mechanism of Encountering Misjudgment by the Justice System

Although misjudgment is an issue of primary concern to the justice system and public safety, the response to misjudgment by the human brain remains unclear. We used fMRI to record neural activity in participants that encountered four possible judgments by the justice system with two basic components: whether the judgment was right or wrong [accuracy: right vs. wrong (misjudgment)] and whether t...

متن کامل

An index-based algorithm for fast on-line query processing of latent semantic analysis

Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the quer...

متن کامل

Predicting False Positives of Protein-Protein Interaction Data by Semantic Similarity Measures

Recent technical advances in identifying protein-protein interactions (PPIs) have generated the genomic-wide interaction data, collectively collectively referred to as the interactome. These interaction data give an insight into the underlying mechanisms of biological processes. However, the PPI data determined by experimental and computational methods include an extremely large number of false...

متن کامل

Implications of Memory Research for Criminal Law Procedure

ion Philosophy MessengerstudentHITSPhilosophy MessengerstudentPhilosophy MessengerstudentFALSE POSITIVES MISSES Figure 7. Showing percentage hits, false positives and missesas a function of preceding task. Inspection of the graphs in figure 7 shows an absenceof any consistent pattern. Prior description had no effecton the identification of the philosophy ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Document Similarity Misjudgment by LSA: Misses vs. False Positives

نویسندگان

چکیده

منابع مشابه

Information Retrieval Perspective to Interactive Data Visualization

The Neural Mechanism of Encountering Misjudgment by the Justice System

An index-based algorithm for fast on-line query processing of latent semantic analysis

Predicting False Positives of Protein-Protein Interaction Data by Semantic Similarity Measures

Implications of Memory Research for Criminal Law Procedure

عنوان ژورنال:

اشتراک گذاری